Model-Free Control
In model-free control, we do not have a model of the environment. The model can be unknown, incomplete, or computationally expensive. We directly learn the policy or the action-value function.
Prediction is the process of estimating the value function, while control is the process of finding the optimal policy. The objective function for control is the action-value function:
On-Policy Control:On-policy control methods learn the policy that they are following. The policy is updated based on the action-value function.
Off-Policy Control: Off-policy control methods learn the policy that is different from the policy that they are following. The target policy is updated based on the action-value function. The behavior policy is the policy that is followed.
Policy evaluation is the process of estimating the value function for a given policy. Policy improvement is the process of finding a better policy based on the value function.
-
Greedy Policy Improvement over Value Function:
- requires a model of the environment
-
Greedy Policy Improvement over Action-Value Function:
- does not require a model of the environment
- uses the action-value function
SARSA
SARSA is an on-policy control method. It is a model-free method that learns the action-value function.
Every time step, the policy evaluation is done with SARSA, and the policy improvement is done with the
n-step SARSA: The update is done with n steps. The return is calculated with n steps.
Forward View SARSA: The return is calculated with all the steps.
Q-Learning
Q-learning is an off-policy control method. It is a model-free method that learns the action-value function.
The next action is selected based on the behavior policy. The update is done based on the target policy.
Both behavior and target policies are improved. Target policy is improved with the greedy policy improvement while the behavior policy is improved with the
#MMI706 - Reinforcement Learning at METU